Natural Lanaguage Processing (NLP) Project

Data Description

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

1. Data Summary

Observations

There are 14,640 rows of data and 15 columns

Observations

Each row is an individual tweet as well as other information about the tweet, especially around the tweet's sentiment

Columns include:

- **Index** which is the index column of the dataset
- **tweet_id** which is the ID # of the tweet
- **airline_sentiment** which is a classification of whether the tweet was positive, negative, or neutral about the airline
- **airline_sentiment_confidence** which is a measure of how likely the sentiment classification is accurate
- **negativereason** which is a subjective comment on why the user had negative sentiment about the airline
- **negativereason_confidence** which is a measure of how likely the negative reason is accurate
- **airline** which is the airline the tweet was made about
- **airline_sentiment_gold** which is unclear what it is
- **name** which is the username of the user who tweeted
- **negativereason_gold** which I'm not sure what it is
- **retweet_count** which is the number of times the tweet was retweeted by other users
- **text** which is the text of the tweet
- **tweet_coord** which I'm not sure what it is
- **tweet_created** which is the date the tweet was made
- **tweet_location** which I'm not sure what it is
- **user_timezone** which is the timezone of the person who made the tweet

Observations

Observations

2. Exploratory Data Analysis (EDA)

Observations

Distribution of Tweets by Airline

Observations

Distribution of Tweet Sentiment by Airline

Observations

Distribution of All Negative Reasons

Observations

Word Cloud Graphs of Tweets

Observations

Observations

3. Understanding of Data Columns

Observations

4. Data Pre-Processing

Noise Removal

Remove HTML Tags

Observations

Remove URLs

Observations

Replace Contractions

Observations

Remove Numbers

Observations

Tokenization

Observations

Normalization

Remove Special Characters

Observations

Remove Punctuation

Observations

Remove Stop Words

Observations

Convert to Lowercase

Observations

Lemmatization

Observations

Convert Back to Text String

Join the words in the list to convert back to text string in the data frame. (So that each row contains the data in text format.

Observations

5. Vectorization

Observation

First, I want to see how common each word is so I'll know what limits to try for the max_features hyperparameter

Observations

CountVectorizer

Observations

The new array has 14,640 rows (1 for each tweet) and 11,626 columns (1 for each word)

TF-IDF Vectorizer

Observations

As as the CV array, the TF-IDF array has 14,640 rows (1 for each tweet) and 1,000 columns (1 for each word) because I limited max_features to 1,000.

6. Modeling, Tuning, and Evaluation

Fit the model using vectorized column - Tune the model to improve the accuracy - Evaluate the model using the confusion matrix (on both types of vectorization) - Print the top 40 features and plot their word cloud using both types of vectorization. (7+7 Marks)

It is not mandatory to encode the target column before training the model. You can proceed with modeling without encoding too

You can apply any classification algorithm which has been covered in the course so far

Fit the CountVectorizer Model

Observations

Data has been split properly and is ready to be fit

Observations

The default random forest model with CountVectorizer has a mean accuracy score of around 72%

Tune the CountVectorizer Model

Observations

Even with RandomSearchCV, it took about 7 mins to find the optimal parameters. That's longer than our previous projects where we were dealing with numerical data

Evaluate the CountVectorizer Model

Observations

Observations

Top 40 Features with CountVectorizer & Random Forest Model

Observations

Word Cloud with CountVectorizer & Random Forest Model

Observations

Fit the TF-IDF Model

Observations

Data has been split properly and is ready to be fit

Observations

Tune the TF-IDF Model

Observations

Evaluate the TF-IDF Model

Observations

Observations

Top 40 Features with TF-IDF

Observations

Word Cloud with TF-IDF

Observations

7. Conclusion